Predicting sentiment from product reviews

Load Graph lab


In [3]:
import graphlab

Read some product review data

Loading reviews for a set of baby products.


In [4]:
products = graphlab.SFrame('amazon_baby.gl/')

Let's explore this data together

Data includes the product name, the review text and the rating of the review.


In [5]:
products.head(5)


Out[5]:
name review rating
Planetwise Flannel Wipes These flannel wipes are
OK, but in my opinion ...
3.0
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0
Stop Pacifier Sucking
without tears with ...
All of my kids have cried
non-stop when I tried to ...
5.0
[5 rows x 3 columns]

Build the word count vector for each review from the products dataset


In [6]:
products['word_count'] = graphlab.text_analytics.count_words(products['review'])

In [7]:
products.head(5)


Out[7]:
name review rating word_count
Planetwise Flannel Wipes These flannel wipes are
OK, but in my opinion ...
3.0 {'and': 5, 'stink': 1,
'because': 1, 'ordered': ...
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0 {'and': 3, 'love': 1,
'it': 2, 'highly': 1, ...
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0 {'and': 2, 'quilt': 1,
'it': 1, 'comfortable': ...
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0 {'ingenious': 1, 'and':
3, 'love': 2, ...
Stop Pacifier Sucking
without tears with ...
All of my kids have cried
non-stop when I tried to ...
5.0 {'and': 2, 'parents!!':
1, 'all': 2, 'puppet.': ...
[5 rows x 4 columns]

set the canvas target as ipynb so that graph will display in iPython Notebook


In [8]:
graphlab.canvas.set_target('ipynb')

In [9]:
products['name'].show()


Examining the reviews for most-sold product: 'Vulli Sophie the Giraffe Teether'


In [10]:
giraffe_reviews = products[products['name'] == 'Vulli Sophie the Giraffe Teether']

In [11]:
len(giraffe_reviews)


Out[11]:
785

total 785 review are there for the Vulli Sophie the Giraffe Teether from the Amazon dataset


In [12]:
giraffe_reviews['rating'].show(view='Categorical')


Build a sentiment classifier


In [13]:
products['rating'].show(view='Categorical')


Define what's a positive and a negative sentiment

We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment. Reviews with a rating of 4 or higher will be considered positive, while the ones with rating of 2 or lower will have a negative sentiment.


In [14]:
#ignore all 3* reviews
products = products[products['rating'] != 3]

In [15]:
#positive sentiment = 4* or 5* reviews
products['sentiment'] = products['rating'] >=4

In [16]:
products.head(6)


Out[16]:
name review rating word_count sentiment
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0 {'and': 3, 'love': 1,
'it': 2, 'highly': 1, ...
1
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0 {'and': 2, 'quilt': 1,
'it': 1, 'comfortable': ...
1
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0 {'ingenious': 1, 'and':
3, 'love': 2, ...
1
Stop Pacifier Sucking
without tears with ...
All of my kids have cried
non-stop when I tried to ...
5.0 {'and': 2, 'parents!!':
1, 'all': 2, 'puppet.': ...
1
Stop Pacifier Sucking
without tears with ...
When the Binky Fairy came
to our house, we didn't ...
5.0 {'and': 2, 'cute': 1,
'help': 2, 'doll': 1, ...
1
A Tale of Baby's Days
with Peter Rabbit ...
Lovely book, it's bound
tightly so you may no ...
4.0 {'shop': 1, 'be': 1,
'is': 1, 'it': 1, 'as': ...
1
[6 rows x 5 columns]

Let's train the sentiment classifier


In [17]:
train_data,test_data = products.random_split(.8, seed=0)

In [18]:
sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                     target='sentiment',
                                                     features=['word_count'],
                                                     validation_set=test_data)


PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 133448
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 219217
PROGRESS: Number of coefficients    : 219218
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 5        | 0.000002  | 26.055295    | 0.841481          | 0.839989            |
PROGRESS: | 2         | 9        | 3.000000  | 45.758070    | 0.947425          | 0.894877            |
PROGRESS: | 3         | 10       | 3.000000  | 53.068897    | 0.923768          | 0.866232            |
PROGRESS: | 4         | 11       | 3.000000  | 61.081645    | 0.971779          | 0.912743            |
PROGRESS: | 5         | 12       | 3.000000  | 70.424750    | 0.975511          | 0.908900            |
PROGRESS: | 6         | 13       | 3.000000  | 80.080783    | 0.899991          | 0.825967            |
PROGRESS: | 7         | 15       | 1.000000  | 95.266292    | 0.984548          | 0.921451            |
PROGRESS: | 8         | 16       | 1.000000  | 105.026024   | 0.985118          | 0.921871            |
PROGRESS: | 9         | 17       | 1.000000  | 115.184357   | 0.987066          | 0.919709            |
PROGRESS: | 10        | 18       | 1.000000  | 125.090439   | 0.988715          | 0.916256            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+

Evaluate the sentiment model


In [19]:
sentiment_model.evaluate(test_data, metric='roc_curve')


Out[19]:
{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 1001
 
 Data:
 +------------------+----------------+-----------------+-------+------+
 |    threshold     |      fpr       |       tpr       |   p   |  n   |
 +------------------+----------------+-----------------+-------+------+
 |       0.0        | 0.224474474474 | 0.0036102373463 | 27976 | 5328 |
 | 0.0010000000475  | 0.775525525526 |  0.996389762654 | 27976 | 5328 |
 | 0.00200000009499 | 0.735548048048 |  0.995353159851 | 27976 | 5328 |
 | 0.00300000002608 | 0.712837837838 |  0.99474549614  | 27976 | 5328 |
 | 0.00400000018999 | 0.69725975976  |  0.994280812125 | 27976 | 5328 |
 | 0.00499999988824 | 0.685998498498 |  0.993923362882 | 27976 | 5328 |
 | 0.00600000005215 | 0.676238738739 |  0.993422933943 | 27976 | 5328 |
 | 0.00700000021607 | 0.665728228228 |  0.993065484701 | 27976 | 5328 |
 | 0.00800000037998 | 0.655405405405 |  0.992815270232 | 27976 | 5328 |
 | 0.00899999961257 | 0.648085585586 |  0.992493565914 | 27976 | 5328 |
 +------------------+----------------+-----------------+-------+------+
 [1001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [20]:
sentiment_model.show(view='Evaluation')


Applying the learned model to understand sentiment for Giraffe


In [21]:
giraffe_reviews['predicted_sentiment'] = sentiment_model.predict(giraffe_reviews, output_type='probability')

In [22]:
giraffe_reviews.head(4)


Out[22]:
name review rating word_count predicted_sentiment
Vulli Sophie the Giraffe
Teether ...
He likes chewing on all
the parts especially the ...
5.0 {'and': 1, 'all': 1,
'because': 1, 'it': 1, ...
0.999513023521
Vulli Sophie the Giraffe
Teether ...
My son loves this toy and
fits great in the diaper ...
5.0 {'and': 1, 'right': 1,
'help': 1, 'just': 1, ...
0.999320678306
Vulli Sophie the Giraffe
Teether ...
There really should be a
large warning on the ...
1.0 {'and': 2, 'all': 1,
'latex.': 1, 'being': 1, ...
0.013558811687
Vulli Sophie the Giraffe
Teether ...
All the moms in my moms'
group got Sophie for ...
5.0 {'and': 2, 'one!': 1,
'all': 1, 'love': 1, ...
0.995769474148
[4 rows x 5 columns]

Sort the reviews based on the predicted sentiment and explore


In [23]:
giraffe_reviews = giraffe_reviews.sort('predicted_sentiment', ascending=False)

In [24]:
giraffe_reviews.head(5)


Out[24]:
name review rating word_count predicted_sentiment
Vulli Sophie the Giraffe
Teether ...
Sophie, oh Sophie, your
time has come. My ...
5.0 {'giggles': 1, 'all': 1,
"violet's": 2, 'food' ...
1.0
Vulli Sophie the Giraffe
Teether ...
I'm not sure why Sophie
is such a hit with the ...
4.0 {'peace': 1, 'month': 1,
'bright': 1, 'softer' ...
0.999999999703
Vulli Sophie the Giraffe
Teether ...
I'll be honest...I bought
this toy because all the ...
4.0 {'all': 2, 'pops': 1,
'existence.': 1, ...
0.999999999392
Vulli Sophie the Giraffe
Teether ...
We got this little
giraffe as a gift from a ...
5.0 {'all': 2, "don't": 1,
'(literally).so': 1, ...
0.99999999919
Vulli Sophie the Giraffe
Teether ...
As a mother of 16month
old twins; I bought ...
5.0 {'cute': 1, 'all': 1,
'reviews.': 2, 'just' ...
0.999999998657
[5 rows x 5 columns]

Most positive reviews for the giraffe


In [25]:
giraffe_reviews[0]['review']

In [26]:
giraffe_reviews[1]['review']

Show most negative reviews for giraffe


In [28]:
giraffe_reviews[-1]['review']


Out[28]:
"My son (now 2.5) LOVED his Sophie, and I bought one for every baby shower I've gone to. Now, my daughter (6 months) just today nearly choked on it and I will never give it to her again. Had I not been within hearing range it could have been fatal. The strange sound she was making caught my attention and when I went to her and found the front curved leg shoved well down her throat and her face a purply/blue I panicked. I pulled it out and she vomited all over the carpet before screaming her head off. I can't believe how my opinion of this toy has changed from a must-have to a must-not-use. Please don't disregard any of the choking hazard comments, they are not over exaggerated!"

In [29]:
giraffe_reviews[-2]['review']


Out[29]:
"This children's toy is nostalgic and very cute. However, there is a distinct rubber smell and a very odd taste, yes I tried it, that my baby did not enjoy. Also, if it is soiled it is extremely difficult to clean as the rubber is a kind of porus material and does not clean well. The final thing is the squeaking device inside which stopped working after the first couple of days. I returned this item feeling I had overpaid for a toy that was defective and did not meet my expectations. Please do not be swayed by the cute packaging and hype surounding it as I was. One more thing, I was given a full refund from Amazon without any problem."

use some normal feature

We used the word counts for all words in the reviews to train the sentiment classifier model. Now, we are going to follow a similar path, but only use this subset of the words:

subset of words are as follows.


In [30]:
selected_words = ['awesome', 'great', 'fantastic', 'amazing','love', 'horrible', 
                  'bad', 'terrible','awful', 'wow', 'hate']

In [31]:
Selected_Frame = graphlab.SArray(selected_words)

In [32]:
Selected_Frame


Out[32]:
dtype: str
Rows: 11
['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']

Now we are using only words from the reviews which are present in the selected words, for that we are using the cleaning function that clearn the review


In [33]:
bow = graphlab.text_analytics.count_words(products['review'])

In [34]:
# Only we are considering the review which are having word count which are present in the Selected Frame
# add a new column for that words_clean 
products['words_clean'] = bow.dict_trim_by_keys(Selected_Frame, exclude=False)

In [35]:
## Remove the old colunm for the words count
products = products['name','review','rating','sentiment','words_clean']

In [36]:
products.head(5)


Out[36]:
name review rating sentiment words_clean
Planetwise Wipe Pouch it came early and was not
disappointed. i love ...
5.0 1 {'love': 1}
Annas Dream Full Quilt
with 2 Shams ...
Very soft and comfortable
and warmer than it ...
5.0 1 {}
Stop Pacifier Sucking
without tears with ...
This is a product well
worth the purchase. I ...
5.0 1 {'love': 2}
Stop Pacifier Sucking
without tears with ...
All of my kids have cried
non-stop when I tried to ...
5.0 1 {'great': 1}
Stop Pacifier Sucking
without tears with ...
When the Binky Fairy came
to our house, we didn't ...
5.0 1 {'great': 1}
[5 rows x 5 columns]

Let's train our model based on the clean words of the data frame


In [37]:
train_data_clean,test_data_clean = products.random_split(.8, seed=0)

In [38]:
sentiment_model_clean = graphlab.logistic_classifier.create(train_data_clean,
                                                     target='sentiment',
                                                     features=['words_clean'],
                                                     validation_set=test_data_clean)


PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 133448
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 11
PROGRESS: Number of coefficients    : 12
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 1.771026     | 0.844299          | 0.842842            |
PROGRESS: | 2         | 3        | 3.105178     | 0.844186          | 0.842842            |
PROGRESS: | 3         | 4        | 4.117828     | 0.844276          | 0.843142            |
PROGRESS: | 4         | 5        | 5.361739     | 0.844269          | 0.843142            |
PROGRESS: | 5         | 6        | 6.693654     | 0.844269          | 0.843142            |
PROGRESS: | 6         | 7        | 7.696012     | 0.844269          | 0.843142            |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+

Evalute the Model


In [39]:
sentiment_model_clean.evaluate(test_data_clean, metric='roc_curve')


Out[39]:
{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 1001
 
 Data:
 +------------------+-------------------+-----+-------+------+
 |    threshold     |        fpr        | tpr |   p   |  n   |
 +------------------+-------------------+-----+-------+------+
 |       0.0        | 0.000187406296852 | 0.0 | 27969 | 5336 |
 | 0.0010000000475  |   0.999812593703  | 1.0 | 27969 | 5336 |
 | 0.00200000009499 |   0.999625187406  | 1.0 | 27969 | 5336 |
 | 0.00300000002608 |   0.999625187406  | 1.0 | 27969 | 5336 |
 | 0.00400000018999 |   0.999437781109  | 1.0 | 27969 | 5336 |
 | 0.00499999988824 |   0.999437781109  | 1.0 | 27969 | 5336 |
 | 0.00600000005215 |   0.999250374813  | 1.0 | 27969 | 5336 |
 | 0.00700000021607 |   0.999250374813  | 1.0 | 27969 | 5336 |
 | 0.00800000037998 |   0.999250374813  | 1.0 | 27969 | 5336 |
 | 0.00899999961257 |   0.999250374813  | 1.0 | 27969 | 5336 |
 +------------------+-------------------+-----+-------+------+
 [1001 rows x 5 columns]
 Note: Only the head of the SFrame is printed.
 You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

In [40]:
sentiment_model_clean.show(view='Evaluation')


Applying the clean learned model to understand sentiment for Giraffe


In [41]:
giraffe_reviews['predicted_sentiment'] = sentiment_model_clean.predict(giraffe_reviews, output_type='probability')

In [42]:
giraffe_reviews.head(4)


Out[42]:
name review rating word_count predicted_sentiment
Vulli Sophie the Giraffe
Teether ...
Sophie, oh Sophie, your
time has come. My ...
5.0 {'giggles': 1, 'all': 1,
"violet's": 2, 'food' ...
0.796940851291
Vulli Sophie the Giraffe
Teether ...
I'm not sure why Sophie
is such a hit with the ...
4.0 {'peace': 1, 'month': 1,
'bright': 1, 'softer' ...
0.796940851291
Vulli Sophie the Giraffe
Teether ...
I'll be honest...I bought
this toy because all the ...
4.0 {'all': 2, 'pops': 1,
'existence.': 1, ...
0.796940851291
Vulli Sophie the Giraffe
Teether ...
We got this little
giraffe as a gift from a ...
5.0 {'all': 2, "don't": 1,
'(literally).so': 1, ...
0.796940851291
[4 rows x 5 columns]

Sort the reviews based on predicted sentiment


In [43]:
giraffe_reviews = giraffe_reviews.sort('predicted_sentiment', ascending=False)

In [44]:
giraffe_reviews.head(4)


Out[44]:
name review rating word_count predicted_sentiment
Vulli Sophie the Giraffe
Teether ...
My 3 month old holds it
good and soft enough to ...
5.0 {'and': 2, 'good': 1,
'old': 1, 'loves': 1, ...
0.796940851291
Vulli Sophie the Giraffe
Teether ...
The baby loves chewing on
it and is able to ...
5.0 {'and': 1, 'maneuver': 1,
'around': 1, 'on': 1, ...
0.796940851291
Vulli Sophie the Giraffe
Teether ...
My son at 5 months
already has his two ...
5.0 {'and': 2, 'chew': 2,
'already': 1, 'throwi ...
0.796940851291
Vulli Sophie the Giraffe
Teether ...
Like every other parent
out there that has ta ...
5.0 {'and': 1, 'it.': 1,
'talked': 1, 'too.': 1, ...
0.796940851291
[4 rows x 5 columns]


In [45]:
sentiment_model['coefficients']


Out[45]:
name index class value
(intercept) None 1 0.729182482603
word_count it. 1 0.0923459975112
word_count recommend 1 0.351653944839
word_count love 1 0.824676597257
word_count it 1 0.00340245508889
word_count disappointed. 1 -2.66907012284
word_count planet 1 -0.28318516271
word_count and 1 0.0387848304637
word_count bags 1 0.132287521499
word_count wipes 1 -0.0146873544927
[219218 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [46]:
sentiment_model_clean['coefficients']


Out[46]:
name index class value
(intercept) None 1 1.36728315229
words_clean love 1 1.39989834302
words_clean great 1 0.883937894898
words_clean amazing 1 0.892802422508
words_clean fantastic 1 0.891303090304
words_clean terrible 1 -2.09049998487
words_clean bad 1 -0.985827369929
words_clean awesome 1 1.05800888878
words_clean wow 1 -0.0541450123332
words_clean hate 1 -1.40916406276
[12 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

THANKS